Developing High-resolution Universal Multy-type n-gram Text Similarity Detector
نویسندگان
چکیده
This paper describes approaches used for the Plagiarism Detection task during PAN 2014 International Competition on Uncovering Plagiarism, Authorship, and Social Software Misuse, that scored 1-st place with plagdet score (0.907) for test corpus no.3 and 3-rd place score (0.868) for test corpus no. 2. In this work we aggregated all the previously researched experience from PAN12 and PAN 13 research works [2] and thus further improved previously developed methods of detecting plagiarism [8], with the help of: contextual ngrams, surrounding context n-grams, named entity based n-grams, odd-even skip n-grams, functional words frame based n-grams, TF-IDF sentence level similarity index and noise sensitive clusterization algorithm, focused summary type detection heuristics, combined into a single model to mark similarity sections and thus effectively detect different types of obfuscation techniques.
منابع مشابه
A performance study of the conceptual implementation of the GEM-tracking detector in Monte Carlo simulation
PANDA experiment (antiProton ANnihilation at DArmstadt) is one of the key projects of the future FAIR facilities to investigate the reactions of antiprotons with protons and nuclear targets. experiment is designed to serve as a completely extraordinary physical potential due to exploiting the availability of cold and high-intensity beams of antiprotons. One of the significant parts of the ...
متن کاملOptimization of an ultra-high-resolution rectangular pixelated parallel-hole collimator with a CZT pixelated semiconductor detector for HiRe-SPECT system
Introduction: In nuclear medicine, the use of a pixelated semiconductor detector such as CZT is an of growing interest for introducing new devices. Especially, the spatial resolution can be improved by using a pixelated parallel-hole collimator with equal holes and pixel sizes based on the pixelated detector. The purpose of this study was to compare the effect of pixelated and ...
متن کاملAn Unsupervised Text Normalization Architecture for Turkish Language
A variety of applications on the problem of short-text messages require text normalization process that transforms ill-formed words into standard ones. Recently, many successful approaches have been applied to text normalization especially for social media text. Since each natural language has its own difficulties and barriers, we need to design an architecture to normalize short text messages ...
متن کاملStudy on Automatic Scoring of Descriptive Type Tests using Text Similarity Calculations
In this paper, we evaluate the automatic scoring of a descriptive type test. In the experiments, three test similarity measures are compared in terms of automatic scoring quality. Two of them are BLEU and RIBES, which are n-gram and word-level matching processes respectively, originally used for automatic evaluation of machine translation output. The other similarity process is Doc2Vec, which u...
متن کاملArabic Text Classification Using N-Gram Frequency Statistics A Comparative Study
This paper presents the results of classifying Arabic text documents using the N-gram frequency statistics technique employing a dissimilarity measure called the “Manhattan distance”, and Dice’s measure of similarity. The Dice measure was used for comparison purposes. Results show that N-gram text classification using the Dice measure outperforms classification using the Manhattan measure.
متن کامل